Optimize MongoDBExportPartitionSupplier for uniform _id type collections#6910
Conversation
✅ License Header Check PassedAll newly added files have proper license headers. Great work! 🎉 |
| import static org.mockito.Mockito.when; | ||
|
|
||
| @ExtendWith(MockitoExtension.class) | ||
| public class MongoDBExportPartitionSupplierIsUniformIdTypeTest { |
There was a problem hiding this comment.
Following existing conventions, add an underscore for clarity: MongoDBExportPartitionSupplier_IsUniformTypeTest. Also, make this package protected (remove public modifier).
| * If uniform, we can use a simple Filters.gt() instead of the complex $or query across all BSON types. | ||
| */ | ||
| boolean isUniformIdType(final MongoCollection<Document> col) { | ||
| final Document first = col.find().projection(ID_PROJECTION).sort(ID_ASC).limit(1).first(); |
There was a problem hiding this comment.
Can these two be combined to avoid two network calls?
There was a problem hiding this comment.
DocumentDB doesn't support $facet aggregation to get first and last in one query. The two queries are both indexed _id lookups (ascending limit 1, descending limit 1) each takes <1ms.
| final Object gteValue = startDoc.get("_id"); | ||
| final String gteClassName = gteValue.getClass().getName(); | ||
|
|
||
| final Document endDoc = col.find(Filters.gte("_id", gteValue)) |
There was a problem hiding this comment.
Maybe name this endOfPageDoc or something similar for clarity.
| .thenReturn(new Document("_id", 3.14)) | ||
| .thenReturn(new Document("_id", Decimal128.parse("99.99"))); | ||
| assertThat(supplier.isUniformIdType(collection), is(true)); | ||
| } |
There was a problem hiding this comment.
Maybe also include a test case for a real number type like double and and integer type as well.
|
|
||
| // isUniformIdType: col.find() called twice (first asc, last desc) | ||
| // then col.find() for last doc when endDoc is null | ||
| when(col.find()).thenReturn(uniformCheckFirst, uniformCheckLast, lastDocIterable); |
There was a problem hiding this comment.
It would be better to use whenAnswer. Then look at the input to determine which to return. This is creating a coupling of the order here with the order in the implementation that need not exist.
For collections with uniform _id types, replace the 8-clause $or query
with a simple Filters.gt("_id", value) for finding partition boundaries.
This allows DocumentDB to use a single B-tree index seek instead of
multi-index scan.
Changes:
- Add isUniformIdType() that checks first/last doc _id types
- Add buildNextStartFilter() with simple $gt for uniform types,
falling back to $or-based query for mixed types
- Use fresh Filters.gte() + skip() per iteration for partition end
- Extract addPartition() helper to reduce duplication
- Make BsonHelper.isClassNumber() public for numeric type grouping
Performance: 14M docs (10GB) partitioned in ~30 seconds.
Signed-off-by: Dinu John <86094133+dinujoh@users.noreply.github.com>
Description
For collections with uniform _id types, replace the $or query with a simple
Filters.gt("_id", value)for finding partition boundaries. This allows DocumentDB to use a single B-tree index seek instead of multi-index scan.Changes:
Performance: 14M docs (10GB) partitioned in ~30 seconds.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.